[SOUND]
In
this lecture we give an overview
of Text Mining and Analytics.
First, let's define the term text mining,
and the term text analytics.
The title of this course is
called Text Mining and Analytics.
But the two terms text mining, and text
analytics are actually roughly the same.
So we are not really going to
really distinguish them, and
we're going to use them interchangeably.
But the reason that we have chosen to use
both terms in the title is because
there is also some subtle difference,
if you look at the two phrases literally.
Mining emphasizes more on the process.
So it gives us a error rate
medical view of the problem.
Analytics, on the other hand
emphasizes more on the result,
or having a problem in mind.
We are going to look at text
data to help us solve a problem.
But again as I said, we can treat
these two terms roughly the same.
And I think in the literature
you probably will find the same.
So we're not going to really
distinguish that in the course.
Both text mining and
text analytics mean that we
want to turn text data into high quality
information, or actionable knowledge.
So in both cases, we
have the problem of dealing with
a lot of text data and we hope to.
Turn these text data into something more
useful to us than the raw text data.
And here we distinguish
two different results.
One is high-quality information,
the other is actionable knowledge.
Sometimes the boundary between
the two is not so clear.
But I also want to say a little bit about
these two different angles of
the result of text field mining.
In the case of high quality information,
we refer to more
concise information about the topic.
Which might be much easier for
humans to digest than the raw text data.
For example, you might face
a lot of reviews of a product.
A more concise form of information
would be a very concise summary
of the major opinions about
the features of the product.
Positive about,
let's say battery life of a laptop.
Now this kind of results are very useful
to help people digest the text data.
And so this is to minimize a human effort
in consuming text data in some sense.
The other kind of output
is actually more knowledge.
Here we emphasize the utility
of the information or
knowledge we discover from text data.
It's actionable knowledge for some
decision problem, or some actions to take.
For example, we might be able to determine
which product is more appealing to us,
or a better choice for
a shocking decision.
Now, such an outcome could be
called actionable knowledge,
because a consumer can take the knowledge
and make a decision, and act on it.
So, in this case text mining supplies
knowledge for optimal decision making.
But again, the two are not so
clearly distinguished, so
we don't necessarily have
to make a distinction.
Text mining is also
related to text retrieval,
which is a essential component
in many text mining systems.
Now, text retrieval refers to
finding relevant information from
a large amount of text data.
So I've taught another separate book
on text retrieval and search engines.
Where we discussed various techniques for
text retrieval.
If you have taken that book,
and you will find some overlap.
And it will be useful To know
the background of text retrieval
of understanding some of
the topics in text mining.
But, if you have not taken that book,
it's also fine because in this book
on text mining and analytics, we're
going to repeat some of the key concepts
that are relevant for text mining.
But they're at the high level and
they also explain the relation between
text retrieval and text mining.
Text retrieval is very useful for
text mining in two ways.
First, text retrieval can be
a preprocessor for text mining.
Meaning that it can help
us turn big text data into
a relatively small amount
of most relevant text data.
Which is often what's needed for
solving a particular problem.
And in this sense, text retrieval
also helps minimize human effort.
Text retrieval is also needed for
knowledge provenance.
And this roughly corresponds
to the interpretation of text
mining as turning text data
into actionable knowledge.
Once we find the patterns in text data, or
actionable knowledge, we generally
would have to verify the knowledge.
By looking at the original text data.
So the users would have to have some text
retrieval support, go back to the original
text data to interpret the pattern or
to better understand an analogy or
to verify whether a pattern
is really reliable.
So this is a high level introduction
to the concept of text mining,
and the relationship between
text mining and retrieval.
Next, let's talk about text
data as a special kind of data.
Now it's interesting to
view text data as data
generated by humans as subjective sensors.
So, this slide shows an analogy
between text data and non-text data.
And between humans as
subjective sensors and
physical sensors,
such as a network sensor or a thermometer.
So in general a sensor would
monitor the real world in some way.
It would sense some signal
from the real world, and
then would report the signal as data,
in various forms.
For example, a thermometer would watch
the temperature of real world and
then we report the temperature
being a particular format.
Similarly, a geo sensor would sense
the location and then report.
The location specification, for
example, in the form of longitude
value and latitude value.
A network sends over
the monitor network traffic,
or activities in the network and
are reported.
Some digital format of data.
Similarly we can think of
humans as subjective sensors.
That will observe the real world and
from some perspective.
And then humans will express what they
have observed in the form of text data.
So, in this sense, human is actually
a subjective sensor that would also
sense what's happening in the world and
then express what's observed in the form
of data, in this case, text data.
Now, looking at the text data in
this way has an advantage of being
able to integrate all
types of data together.
And that's indeed needed in
most data mining problems.
So here we are looking at
the general problem of data mining.
And in general we would Be
dealing with a lot of data
about our world that
are related to a problem.
And in general it will be dealing with
both non-text data and text data.
And of course the non-text data
are usually produced by physical senses.
And those non-text data can
be also of different formats.
Numerical data, categorical,
or relational data,
or multi-media data like video or speech.
So, these non text data are often
very important in some problems.
But text data is also very important,
mostly because they contain
a lot of symmetrical content.
And they often contain
knowledge about the users,
especially preferences and
opinions of users.
So, but by treating text data as
the data observed from human sensors,
we can treat all this data
together in the same framework.
So the data mining problem is
basically to turn such data,
turn all the data in your actionable
knowledge to that we can take advantage
of it to change the real
world of course for better.
So this means the data mining problem is
basically taking a lot of data as input
and giving actionable knowledge as output.
Inside of the data mining module,
you can also see
we have a number of different
kind of mining algorithms.
And this is because, for
different kinds of data,
we generally need different algorithms for
mining the data.
For example,
video data might require computer
vision to understand video content.
And that would facilitate
the more effective mining.
And we also have a lot of general
algorithms that are applicable
to all kinds of data and those algorithms,
of course, are very useful.
Although, for a particular kind of data,
we generally want to also
develop a special algorithm.
So this course will cover
specialized algorithms that
are particularly useful for
mining text data.
[MUSIC]

